ReCache: Reactive Caching for Fast Analytics over Heterogeneous Data
نویسندگان
چکیده
As data continues to be generated at exponentially growing rates in heterogeneous formats, fast analytics to extract meaningful information is becoming increasingly important. Systems widely use in-memory caching as one of their primary techniques to speed up data analytics. However, caches in data analytics systems cannot rely on simple caching policies and a fixed data layout to achieve good performance. Different datasets and workloads require different layouts and policies to achieve optimal performance. This paper presents ReCache, a cache-based performance accelerator that is reactive to the cost and heterogeneity of diverse raw data formats. Using timing measurements of caching operations and selection operators in a query plan, ReCache accounts for the widely varying costs of reading, parsing, and caching data in nested and tabular formats. Combining these measurements with information about frequently accessed data fields in the workload, ReCache automatically decides whether a nested or relational columnoriented layout would lead to better query performance. Furthermore, ReCache keeps track of commonly utilized operators to make informed cache admission and eviction decisions. Experiments on synthetic and real-world datasets show that our caching techniques decrease caching overhead for individual queries by an average of 59%. Furthermore, over the entire workload, ReCache reduces execution time by 19-75% compared to existing techniques. PVLDB Reference Format: Tahir Azim, Manos Karpathiotakis and Anastasia Ailamaki. ReCache: Reactive Caching for Fast Analytics over Heterogeneous Data. PVLDB, 11(3): xxxx-yyyy, 2017. DOI: 10.14778/3157794.3157801
منابع مشابه
Neutrino: Revisiting Memory Caching for Iterative Data Analytics
In-memory analytics frameworks such as Apache Spark are rapidly gaining popularity as they provide order of magnitude performance speedup over disk-based systems for iterative workloads. For example, Spark uses the Resilient Distributed Dataset (RDD) abstraction to cache data in memory and iteratively compute on it in a distributed cluster. In this paper, we make the case that existing abtracti...
متن کاملDistributed Caching for Complex Querying of Raw Arrays
As applications continue to generate multi-dimensional data at exponentially increasing rates, fast analytics to extract meaningful results is becoming extremely important. The database community has developed array databases that alleviate this problem through a series of techniques. In-situ mechanisms provide direct access to raw data in the original format—without loading and partitioning. P...
متن کاملBig Data Quality: From Content to Context
Over the last 20 years, and particularly with the advent of Big Data and analytics, the research area around Data and Information Quality (DIQ) is still a fast growing research area. There are many views and streams in DIQ research, generally aiming at improving the effectiveness of decision making in organizations. Although there are a lot of researches aimed at clarifying the role of BIG data...
متن کاملBig Data Analytics in Bioinformatics: A Machine Learning Perspective
Bioinformatics research is characterized by voluminous and incremental datasets and complex data analytics methods. The machine learning methods used in bioinformatics are iterative and parallel. These methods can be scaled to handle big data using the distributed and parallel computing technologies. Usually big data tools perform computation in batch-mode and are not optimized for iterative pr...
متن کاملPerformance of Web Proxy Caching in Heterogeneous Bandwidth Environments
Much work on the performance of Web proxy caching has focused on high-level metrics such as hit rates, but has ignored low-level details such as cookies, aborted connections, and persistent connections between clients and proxies as well as between proxies and servers. These details have a strong impact on performance, particularly in heterogeneous bandwidth environments where network speeds be...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- PVLDB
دوره 11 شماره
صفحات -
تاریخ انتشار 2017